The article investigates the limitations of Transformers in performing multi-digit multiplication, revealing that while these models can encode necessary long-range dependencies, they often converge to local optima that fail to utilize them effectively. The authors propose an auxiliary loss to enhance learning dynamics and successfully address the issue of learning long-range dependencies in Transformers.